Skip to content

[SYSTEMDS-3902] Accelerated data transfer Python <--> JVM#2296

Closed
e-strauss wants to merge 1 commit into
apache:mainfrom
e-strauss:data_transfer_PR
Closed

[SYSTEMDS-3902] Accelerated data transfer Python <--> JVM#2296
e-strauss wants to merge 1 commit into
apache:mainfrom
e-strauss:data_transfer_PR

Conversation

@e-strauss

@e-strauss e-strauss commented Jul 22, 2025

Copy link
Copy Markdown
Contributor

Benchmark results

I did experiments on two different systems:

  • Mac w/ M2 and 16GB memory
  • Linux ThinkPad w/ Ryzen 5 and 64GB memory

Experiment 1: Python --> Java:

Experiment-Setup:

  • allocate np array in python (time excluded)
  • start systemds context w/ pipe setup (time included)
  • transfer the array to systemds 5x (time included)

Note:

  • for arrays, that are larger than ~152.588 MB, the py4j implementation had a runtime exception (allocation of an array with negative size)

Hardware: M2 w/ 16GB memory

Array Size py4j runtime in s unix pipe runtime in s
0.763 MB 0.186 0.103
3.815 MB 0.3071 0.119
7.629 MB 0.421 0.145
38.147 MB 1.357 0.332
76.294 MB 2.52 0.607
152.588 MB 7.096 1.073
381.470 MB ERR 2.39
0.745 GB ERR 4.555
1.49 GB ERR 13.4067

Hardware: Ryzen 5 w/ 64GB memory

For the spark experiment, I used reference implementation with pyspark and arrow. I created a single column arrow table with the numeric data, created a spark df from the table and and triggered the transfer via a simple count action.

Runtime in s

#elements Size py4j single unix pipe multi unix pipes mmap spark
100000 781.25 KB 0.337 0.119 0.197 0.148 1.196
500000 3.81 MB 0.646 0.161 0.261 0.215 3.496
1000000 7.63 MB 1.657 0.194 0.324 0.277 6.349
5000000 38.15 MB 3.793 0.454 0.560 0.798 36.516
10000000 76.29 MB 7.487 0.767 0.915 1.434 4.426
20000000 152.59 MB 14.030 1.426 1.574 2.673 5.617
50000000 381.47 MB ERR 3.397 3.579 6.275 12.467
100000000 762.94 MB ERR 6.682 7.354 12.909 25.100
200000000 1.49 GB ERR 11.801 12.470 27.549 47.890

@phaniarnab

Copy link
Copy Markdown
Contributor

@e-strauss. The speedups look good. Can you also mention the size of the inputs beside #elements? Size is easier to comprehend?
Wondering why we are not using Arrow format for JVM <-> Python data transfer like Spark does [1]?

[1] https://docs.databricks.com/aws/en/pandas/pyspark-pandas-conversion

@e-strauss

Copy link
Copy Markdown
Contributor Author

@phaniarnab afaik Apache Arrow gives us just an unified format, but not per-se transport way for the data transfer. In this case, we are transferring dense arrays, where we wouldn't gain anything from the Arrow format and have the overhead for the transformation from pandas to arrow and from arrow to our internal matrix format, I guess.

When it comes to transferring frames, we can use the arrow representation as format, and send it over our FIFO pipe

@codecov

codecov Bot commented Jul 23, 2025

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.65%. Comparing base (480e9a0) to head (25f7bad).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2296      +/-   ##
============================================
+ Coverage     72.61%   72.65%   +0.03%     
- Complexity    46267    46320      +53     
============================================
  Files          1490     1491       +1     
  Lines        174390   174590     +200     
  Branches      34210    34236      +26     
============================================
+ Hits         126634   126845     +211     
+ Misses        38194    38173      -21     
- Partials       9562     9572      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@e-strauss

Copy link
Copy Markdown
Contributor Author

@phaniarnab I added a new experiment for spark with arrow tables as comparison. For the experiment, I created an arrow table in python and transferred it to spark by triggering the computation using count action. The experiment can be found here.

For larger data sizes, the runtime goes down significantly, since spark switches from a LocalRelation in createDataFrame to a RDD-based createDataFrame. Both with Arrow optimization.

@e-strauss e-strauss force-pushed the data_transfer_PR branch 3 times, most recently from 0afe221 to 563b70b Compare July 30, 2025 12:37
@e-strauss e-strauss changed the title [WIP] Data transfer Python <--> Java [SYSTEMDS-3902] Accelerated data transfer Python <--> JVM Jul 30, 2025
@e-strauss e-strauss requested a review from Baunsgaard July 30, 2025 21:58

@Baunsgaard Baunsgaard left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread src/main/java/org/apache/sysds/runtime/util/UnixPipeUtils.java
@e-strauss e-strauss closed this in 264bdcd Jul 31, 2025
@github-project-automation github-project-automation Bot moved this from In Progress to Done in SystemDS PR Queue Jul 31, 2025
@e-strauss e-strauss deleted the data_transfer_PR branch July 31, 2025 17:25
j143 pushed a commit to j143/systemds that referenced this pull request Aug 9, 2025
Introduced a new data transfer mechanism on Unix systems using FIFO (named) pipes as a faster alternative to py4j-based communication.
- Supports multiple value types (uint8, int32, fp32, fp64) for dense matrix exchange.
- Adds experimental support for partitioned matrix transfer from Python to Java via multiple concurrent pipes (disabled by default due to limited performance improvement).
- Significantly reduces overhead compared to py4j for large matrix transfers in supported scenarios

Closes apache#2296.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants